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Abstract 

Imagine being shown N samples of random variables drawn independently 
from the same distribution. What can you say about the distribution? In 
general, of course, the answer is nothing, unless you have some prior notions 
about what to expect. From a Bayesian point of view one needs an a priori 
distribution on the space of possible probability distributions, which defines 
a scalar field theory. In one dimension, free field theory with a normalization 
constraint provides a tractable formulation of the problem, and we discuss 
generalizations to higher dimensions. 



1 



As we watch the successive flips of a coin (or the meanderings of stock prices), we ask 
ourselves if what we see is consistent with the conventional probabilistic model of a fair 
coin. More quantitatively, we might try to fit the data with a definite model that, as we 
vary parameters, includes the fair coin and a range of possible biases. The estimation of these 
underlying parameters is the classical problem of statistical inference or 'inverse probability,' 
and has its origins in the foundations of probability theory itself [||. But when we observe 
continuous variables, the relevant probability distributions are functions, not finite lists of 
numbers as in the classical examples of fiipping coins or rolling dice. In what sense can 
we infer these functions from a finite set of examples? In particular, how do we avoid the 
solipsistic inference in which each data point we have observed is interpreted as the location 
of a narrow peak in the underlying distribution? 

Let the variable of interest be x with probability distribution Q{x); we start with the one 
dimensional case. We are given a set of points xi,X2, ■ ■ ■ ,xn that are drawn independently 
from Q{x), and are asked to estimate Q{x) itself. One approach is to assume that all possible 
Q{x) are drawn from a space parameterized by a finite set of coordinates, implicitly excluding 
distributions that have many sharp features. In this case, it is clear that the number of 
examples can eventually overwhelm the number of parameters K . Although the finite 
dimensional case is often of practical interest, one would like a formulation faithful to the 
original problem of estimating a function rather than a limited number of parameters. 

No finite number of examples will determine uniquely the whole function Q{x), so we 
require a probabilistic description. Using Bayes' rule, we can write the probability of the 
function Q{x) given the data as 



where we make use of the fact that each Xi is chosen independently from the distribution 
Q{x), and P[Q{x)] summarizes our a priori hypotheses about the form of Q{x). If asked for 



P[Q{x)\Xi,X2, ...,xn] 

_ P[Xi,X2, ...,XN\Qix)]P[Q{x)] 
P{Xi,X2, ...,Xn) 

Qixi)Q{x2)---Q{xN)P[Q{x)] 
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an explicit estimate of Q{x), one might try to optimize the estimate so that the mean square 
deviation from the correct answer is, at each point x, as small as possible. This optimal 
least-square estimator Qcst{x] {xi}) is the average of Q{x) in the conditional distribution of 
Eq. (0), which can be written as 

^ . r {Q{x)Q{x^)Qix2)---QixN))^'^ 

{Q{x^)Q{x,)■■■QM)(^) ' 
where by (■ ■ ■)(°) we mean expectation values with respect to the a priori distribution 
P[Q{x)]. The prior distribution P[Q{x)] is a scalar field theory, and the n-point functions 
of this theory are precisely the objects that determine our inferences from the data. 

The restriction of the distribution Q{x) to a finite dimensional space represents, in the 
field theoretic language, a sharp ultraviolet cutoff scheme. Several authors have considered 
the problem of choosing among distributions with different numbers of parameters, which 
corresponds to assuming that the true theory, P[Q{x)], has a hard ultraviolet cutoff whose 
unknown location is to be set by this choice. As in field theory itself, one would like to have 
a theory in which one can remove the cutoff without any unpleasant consequences. Our 
Bayesian approach will provide this. 

The prior distribution, P[Q{x)], should capture our prejudice that the distribution Q{x) 
is smooth, so P[Q{x)] must penalize large gradients, as in conventional field theories. To 
have a field variable (p{x) that takes on a full range of real values (—00 < < 00), we write 

Q{x) = jexp[-<Pix)], (4) 

where i is an arbitrary length scale. Then we take to be a free scalar field with a constraint 
to enforce normalization of Q{x). Thus is chosen from a probability distribution 



Pe[(f){x)] = ^exp 
x6 



if 

-- J dx{d^(f)Y 



(5) 



where we write Pi[(j){xy\ to remind us that we have chosen a particular value for we will 
later consider averaging over a distribution of £'s, P{€). The objects of interest are the 
correlation functions: 



(Q(xi)Q(x2)---g(x^))(°) 



r 1 

/ Dct>P[ct>{x)]X{-eM-^{^^)] (6) 
i=i 



where, by introducing the Fourier representation of the delta function, we define the action 

I 



S{cl>- X) = -Jdx{d, 

+1- / c/xe-<^(^) + ^ ct){xi) - i\. (8) 
^ •' i=i 

We evaluate the functional integral in Eq. (|^) in a semiclassical approximation, which 
becomes accurate as becomes large. Keeping only the configuration corresponding to ex- 
tremizing the action — the pure classical approximation, with no fluctuations — is equivalent 
to maximum likelihood estimation, which chooses the distribution, Q{x), that maximizes 
P[Q(a;)|{xj}]. In our case, integration over fluctuations will play a crucial role in setting the 
proper value of the scale i. 

The classical equations of motion for and A RrG, cis usual, 

6S{(j); A) dS{(j); A) 



6(f){x) dX 



0, (9) 



which imply 



A ^ 

idlcP^iix) + z^e-^^'(^) = E ^(^ - (10) 
^ J dxe-^='(") = 1. (11) 

Integrating Eq. ([T0| ) and comparing with Eq. (|TT|), we find that iXd = N, provided that 
d4>{x) vanishes as |x| oo ||^. If the points {xi} are actually chosen from a distribution 
P{x), then, as ^ oo, we hope that <f)ci{x) will converge to — \n[£P{x)]. This would guar- 
antee that our average over all possible distributions Q{x) is dominated by configurations 
Qci{x) that approximate the true distribution. So we write 4>ci{x) = — \n[iP{x)] +ip{x) and 
expand Eq. (pUj) to first order in iIj{x). In addition we notice that the sum of delta functions 



can be written as 



N 



S{x - Xi) = NP{x) + VNp{x), 



(12) 



i=l 



where p{x) is a fluctuating density such that 



{p{x)p{x')) = P{x)6{x — x). 



The (hopefully) small field tp^x) obeys the equation 



(13) 



^^l - NP{x) ij{x) = VNp{x) + id'jnP{x), 



which we can solve by WKB methods because of the large factor N: 



(14) 



iIj{x) = J dx'K{x,x') [y/Np{x')+edl\nP{x') 



K{x,x') 



2\fN 
X exp 



'ep{x)p{x') 



-1/4 
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N P{y) 
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(15) 



(16) 



lmm(x,x') 

Thus the "errors" tlj{x) in our estimate of the distribution involve an average of the fluctu- 
ating density over a region of (local) size ^ ~ [i/NP(x)Y^'^. The average systematic error 
and the mean-square random error are easily computed in the limit N ^ (yo because this 
length scale becomes small. We find 



NP{x^ 
1 1 

4 ^NP{x)£ 



dllnP{x) + 
+ •••, 



(17) 
(18) 



Higher moments also decline as powers of A'^, justifying our claim that the classical solution 
converges to the correct distribution. 

The complete semiclassical form of the n-point function is 



(g(a;i)g(x2)---g(x^))(°) 

^ ^Rexp[-S{(f>^uX^-iN)], 



(19) 



where R is the ratio of determinants, 



« = [ iMi^) J ■ (20) 

This has to be computed a bit carefully — there is no restoring force for fluctuations A, but 
these can be removed by fixing the spatially uniform component of 0(x), which enforces 
normalization of Q{x). Since everything is finite in the infrared this is does not pose a 
problem [Q. Then the computation of the determinants is standard f^, and we find 

1 /iV\i/2 



R = exp 



dxJQciix) 



(21) 



where as before we use the limit oo to simplify the result 0. It is interesting to 

note that R can also be written as exp[— (1/2) / dx C,~^], so the fluctuation contribution to 
the effective action counts the number of independent "bins" (of size ~ ^) that are used in 
describing the function Q{x). 

Putting the factors together, we find that 

(Q(xi)g(x2)---g(x;v))^°^ 

N 

^l[Pix^)exp[-F{xi,X2,---,XN)], (22) 

i=l 

where the correction term F is given by 

F{{x.}) = ^ (y) / dxP''\x)e-^^^^l' 

+ - j dxid^ In P - 9,^)2 + J2 H^'d (23) 

i=l 

One might worry that iIj{x) is driven by density fluctuations that include delta functions 
at points Xi, while, when we evaluate F, we sum up the values of ip{x) precisely at these 
singular points. In fact, these terms are finite and of the same order of magnitude as the 
fluctuation determinant. Similarly, our estimate of the probability distribution from Eq. (^ 
is finite even when we ask about Q{x) at the points where we have been given examples. 
This is not so surprising — we are in one dimension where ultraviolet divergences should not 
be a problem. 
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Although our theory is finite in the ultraviolet, we do have an arbitrary length scale £. 
This means that we define, a priori, a scale on which variations of the probability Q{x) are 
viewed as "too fast." One would rather let all scales in our estimate of the distribution 
Q{x) emerge from the data points themselves. We can restore scale invariance (perhaps 
scale indifference is a better term here) by viewing ^ itself as a parameter that needs to be 
determined. Thus, as a last step in evaluating the functional integral, we should integrate 
over £, weighted by some prior distribution, P(^), for values of this parameter. The hope 
is that this integral will be dominated by some scale, that is determined primarily by 
the structure of Q{x) itself, at least in the large N limit. As long as our a priori knowledge 
about I can be summarized by a reasonably smooth distribution, then, at large A^, must 
be the minimum of F, since this is the only place where ^ appears with coefficients that grow 
as powers of A^. To see how this works we compute the average value of F and minimize 
with respect to £. The result is 



Strictly, one should use a particular value of F and not its average, but fiuctuations are of 
lower order in A^ and do not change the qualitative result oc N^^^. 

The semiclassical evaluation of the relevant functional integrals thus gives a classical 
configuration that smooths the examples on a scale ^ oc {i/NY^"^, and the scale i is selected 
by a competition between the classical kinetic energy or smoothness constraint and the 
fiuctuation determinant. If the fiuctuation effects were ignored, as in maximum likelihood 
estimation, i would be driven to zero and we would be overly sensitive to the details of 
the data points. This parallels the discussion of "Occam factors" in the finite dimensional 
case, where the phase space factors from integration over the parameters {gi_i} serve to 
discriminate against models with larger numbers of parameters 0. It is not clear from 
the discussion of finite dimensional models, however, whether these factors are sufficiently 
powerful to reject models with an infinite number of parameters. Here we see that, even in 
an infinite dimensional setting, the fiuctuation terms are sufficient to control the estimation 
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(24) 



problem and select a model with finite, A^-dependent, complexity. 

Because we are trying to estimate a function, rather than a finite number of parameters, 
we must allow ourselves to give a more and more detailed description of the function Q{x) as 
we see more examples; this is quantified by the scale ^* on which the estimated distribution 
is forced to be smooth. With the selection of the optimal i from Eq. (|2^), we see that 
^* oc A^^^/^. The classical solution converges to the correct answer with a systematic error, 
from Eq. (^17\} , that vanishes as {ip) oc N~'^/^, while the random errors have a variance [Eq. 
(|18|)] that vanishes with the same power of A^. We can understand this result by noting 
that in a region of size S,* there are, on average, A'^ex ~ NP{x)C,* examples, which scales as 
A^ex oc A^^/^; the random errors then have a standard deviation ^V^rms ~ 1 / V A"ex [0- 

How does this discussion generalize to higher dimensions? If we keep the simple free field 
theory then we will have problems with ultraviolet divergences in the various correlation 
functions of the field Because Q{x) = (l/£) exp[— ultraviolet divergences in 

mean that we cannot define a normalizable distribution for the possible values of Q at a 
single point in the continuum limit. In terms of information theory |p, if functions Q{x) are 
drawn from a distribution functional with ultraviolet divergences, then even specifying the 
function Q{x) to finite precision requires an infinite amount of information. 

As an alternative, we can consider higher derivative actions in higher dimensions. All 
the calculations are analogous to those summarized above, so here we only list the results. 
If we write, in D dimensions, Q{x) = (l/i^) exp[— and choose a prior distribution 



P[^ix)] = -^exp 
x5 



fi2a-D 



2 



d^x{&^,ct>y 



l-l^ld-xe-^i^[ 



(25) 



then to insure finiteness in the ultraviolet we must have 2a > D. The saddle point equations 
lead to a distribution that smooths the examples on a scale e ~ (£2"-^/iVQ)V2a^ ^^^^ 
fluctuation determinant makes a contribution to the action oc / d^x[NQ(x)/i^"~^]^^'^°'. 
Again we find the optimal value of £ as a compromise between this term and the kinetic 
energy, resulting in £^ oc A^^/(*^"^~'^^). Then the optimal value of ^ becomes ^* oc A^~i/(2"+^), 
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so that the estimated distribution is smooth in volumes of dimension that contain N^x ~ 
NQ^^ ~ jY2Q/(2a+D) examples. Then the statistical errors in the estimate will behave as 

6^|J,^, oc 6Q/Q ~ N'^'/^ ~ N'^, (26) 

with the "error exponent" fi = a /{2a + D). Note that since 2a > D, the exponent 1/4 < 
fi < 1/2. The most rapid convergence, fi = 1/2, occurs if Q{x) is drawn from a family of 
arbitrarily smooth {a —* oo) distributions, so we can choose fixed, small bins in which to 
accumulate the samples, leading to the naive 1 / counting statistics. If we assume that 
our prior distribution functional is local, then a must be an integer and we can have fi —>■ 1/4 
only as D ^ oo, so that the slowest possible convergence occurs in infinite dimension. 

The fact that higher dimensional functions are more difficult to learn is often called the 
'curse of dimensionality.' We see that this is not just a quantitative problem — unless we hy- 
pothesize that higher dimensional functions are drawn from ensembles with proportionately 
higher order notions of smoothness, one would require an infinite amount of information 
to specify the function at finite precision. Once we adopt these more stringent smoothness 
hypotheses, however, the worst that happens is a reduction in the error exponent ;U by a 
factor of two. 

Is there a more general motivation for the choice of action in Eq. (^51) ? First, we note 
that this action gives the maximum entropy distribution consistent with a fixed value of 
/ d^x{d^(f))'^, and by integrating over i we integrate over these fixed values. Thus our action 
is equivalent to the rather generic assumption that our probability distributions are drawn 
from an ensemble in which this "kinetic energy" is finite. Second, addition of a constant 
to (f){x) can be absorbed in a redefinition of i, and since we integrate over i it makes sense 
to insist on (j) (j) + const, as a symmetry. Finally, addition of other terms to the action 
cannot change the asymptotic behavior at large N unless these terms are relevant operators 
in the ultraviolet. Thus many different priors P[Q{x)] will exhibit the same asymptotic 
convergence properties, indexed by a single exponent fi{a). 
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